Hybrid Language Segmentation for Historical Documents

نویسندگان

  • David Alfter
  • Yuri Bizzoni
چکیده

English. Language segmentation, i.e. the division of a multilingual text into monolingual fragments has been addressed in the past, but its application to historical documents has been largely unexplored. We propose a method for language segmentation for multilingual historical documents. For documents that contain a mix of highand low-resource languages, we leverage the high availability of highresource language material and use unsupervised methods for the low-resource parts. We show that our method outperforms previous efforts in this field. Italiano. La segmentazione del linguaggio, la divisione di un testo multilingue in frammenti monolingue, è stata affrontata nel passato, ma la sua applicazione a documenti storici è rimasta in gran parte inesplorata. Proponiamo un metodo per la segmentazione linguistica di documenti storici multilingue. Per documenti che contengono sia lingue ad alta disponibilità di risorse che lingue sottorappresentate, utilizziamo a nostro vantaggio l’elevata disponibilità delle lingue con un’ampia gamma di risorse e impieghiamo sistemi non supervisionati per le parti che dispongono di un minor numero di risorse. Mostriamo che il nostro metodo supera gli sforzi precedenti in questo settore.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Holistic Methodology for Keyword Search in Historical Typewritten Documents

In this paper, we propose a novel holistic methodology for keyword search in historical typewritten documents combining synthetic data and user's feedback. The holistic approach treats the word as a single entity and entails the recognition of the whole word rather than of individual characters. Our aim is to search for keywords typed by the user in a large collection of digitized typewritten h...

متن کامل

Persian Printed Document Analysis and Page Segmentation

This paper presents, a hybrid method, low-resolution and high-resolution, for Persian page segmentation. In the low-resolution page segmentation, a pyramidal image structure is constructed for multiscale analysis and segments document image to a set of regions. By high-resolution page segmentation, by connected components analysis, each region is segmented to homogeneous regions and identifyi...

متن کامل

Recognition the Sociological and Architectural Components based on Geographical Segmentation Technique by Value-normative Paradigm

A house, as a primary dwelling is designed according to life style and current values in the life and mind of Residents. House is a cultural element, containing cultural meanings situated in the spirit of a house, distinguish the form of other houses. Special life style and conduct of residents becomes value through time. This value organizes the meaning in the mind and determines meaning of li...

متن کامل

Segmentation of Handwritten Characters for Digitalizing Korean Historical Documents

The historical documents are valuable cultural heritages and sources for the study of history, social aspect and life at that time. The digitalization of historical documents aims to provide instant access to the archives for the researchers and the public, who had been endowed with limited chance due to maintenance reasons. However, most of these documents are not only written by hand in ancie...

متن کامل

Hybrid Segmentation Prototype for Arabic Text-Based Documents: Towards Plagiarism Detection

The contribution of this work relates to the field of Arabic text-based document analysis for the detection of plagiarism. This analysis will be carried out according to the triadic computation model of document similarity. The authors propose a hybrid segmentation prototype for Arabic text-based documents that links different processing steps in order to generate the similarity rate between th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016